Methods and apparatus for performing sequence homology detection

ABSTRACT

In a sequence homology detection aspect of the invention, a computer-based method of detecting homologies between a plurality of sequences in a database and a query sequence comprises the following steps. First, the method includes accessing patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database. Next, the query sequence is compared to the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns. Then, a score is generated for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score represents a degree of homology between the query sequence and the detected sequence.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to provisional U.S. patentapplication No. 60/106,295, filed Oct. 30, 1998, in the names of A.Floratos and I. Rigoutsos, and is related to U.S. patent application,entitled “Methods and Apparatus for Performing Pattern DictionaryFormation for Use in Sequence Homology Detection,” filed concurrentlyherewith, in the names of A. Floratos and I. Rigoutsos, the disclosuresof which are incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates generally to database searches and, moreparticularly, to methods and apparatus for detecting sequence homologybetween a query sequence and sequences in a database in association witha given application, e.g., genetic research.

BACKGROUND OF THE INVENTION

In the area of genetic research, the first step following the sequencingof a new gene is an effort to identify that gene's function. The mostpopular and straightforward methods to achieve that goal exploit thefollowing biological fact—if two peptide stretches exhibit sufficientsimilarity at the sequence level (i.e., one can be obtained from theother by a small number of insertions, deletions and/or amino acidmutations), then they probably are biologically related. Examples ofsuch an approach are described in A. M. Lesk, “Computational MolecularBiology,” Encyclopedia of Computer Science and Technology; A. Kent andJ. G. Williams editors, 31:101-165, Marcel Dekker, New York, 1994; R. F.Doolittle, “What we have learned and will learn from sequencedatabases,” Computers and DNA, G. Bell and T. Marr editors, 21-31,Addison-Wesley, 1990; C. Caskey, R. Eisenberg, E. Lander, and J. Straus,“Hugo statement on patenting of DNA,” Genome Digest, 2:6-9, 1995; and W.R. Pearson, “Protein sequence comparison and protein evolution,”Tutorial of Intelligent Systems in Molecular Biology, Cambridge,England, 1995.

Within this framework, the question of getting clues about the functionof a new gene becomes one of identifying homologies in strings of aminoacids. Generally, a homology refers to a similarity, likeness, orrelation between two or more sequences or strings. Thus, one is given aquery sequence Q (e.g., the new gene) and a set D of well characterizedproteins and is looking for all regions of Q which are similar toregions of sequences in D.

The first approaches used for realizing this task were based on atechnique known as dynamic programming. This approach is described in S.B. Needleman and C. D. Wunsch, “A General Method Applicable To TheSearch For Similarities In The Amino Acid Sequence Of Two Proteins,”Journal Of Molecular Biology, 48:443-453, 1970; and T. F. Smith and M.S. Waterman, “Identification Of Common Molecular Subsequences,” JournalOf Molecular Biology, 147:195-197, 1981. Unfortunately, thecomputational requirements of this method quickly render it impractical,especially when searching large databases, as is the norm today.Generally, the problem is that dynamic programming variants spend a goodpart of their time computing homologies which eventually turn out to beunimportant.

In an effort to work around this issue, a number of algorithms have beenproposed which focus on discovering only extensive local similarities.The most well known among these algorithms are referred to as FASTA andBLAST. The FASTA algorithm is described in W. R. Pearson, and D. J.Lipman, “Improved tools for biological sequence comparison,” Proc. Natl.Acad. Sci., 85:2444-2448, 1988; and D. J. Lipman, and W. R. Pearson,“Rapid and sensitive protein similarity searches,” Science,227:1435-1441, 1989. The BLAST algorithm is described in S. Altschul, W.Gish, W. Miller, E. W. Myers, and D. Lipman, “A basic local alignmentsearch tool,” J. Mol. Biology, 215:403-410, 1990. In the majority of thecases, increased performance is achieved by first looking for ungappedhomologies, i.e., similarities due exclusively to mutations and notinsertions or deletions. The rationale behind this approach is that inany substantial gapped homology between two peptide strings, chances arethat there exists at least a pair of substrings whose match contains nogaps. The locating of these substrings (the ungapped homology) can thenbe used as the first step towards obtaining the entire (gapped)homology.

Identifying the similar regions between the query and the databasesequences is, however, only the first part (the computationally mostdemanding) of the process. The second part (the one that is of interestto biologists) is evaluating these similarities, i.e., deciding if theyare substantial enough to sustain the inferred relation (functional,structural or otherwise) between the query and the corresponding database sequence(s). Such evaluations are usually performed by combiningbiological information and statistical reasoning. Typically, similarityis quantified as a score computed for every pair of related regions.Computation of this score involves the use of gap costs (for gappedalignments) and of appropriate mutation matrices giving the evolutionaryprobability of any given amino acid changing into another. Examples ofthese matrices are the PAM matrix (see M. O. Dayhoff, R. M. Schwartz andB. C. Orcutt, “A model of evolutionary change in proteins,” Atlas ofProtein Sequence and Structure, 5:345-352, 1978) and the BLOSUM matrix(see S. Henikoff and J. G. Henikoff, “Amino acid substitution matricesfrom protein blocks,” Proc. Natl. Acad. Sci., 89:915-919, 1992). Then,the statistical importance of this cost is evaluated by computing theprobability (under some statistical model) that such a score could arisepurely by chance, e.g., see S. Karlin, A. Dembo and T. Kawabata,“Statistical composition of high-scoring segments from molecularsequences,” The Annals of Statistics, 2:571-581,1990; and S. Karlin andS. Altschul, “Methods for assessing the statistical significance ofmolecular sequence features by using general scoring schemes,” Proc.Natl. Acad. Sci., 87:2264-2268, 1990. Depending on the statistical modelused, this probability can depend on a number of factors such as: thelength of the query sequence, the size of the underlying database, etc.No matter, however, what conventional statistical model one uses thereare always the so called “gray areas,” i.e., situations where astatistically unimportant score indicates really a biologicallyimportant similarity. Unfortunate as this might be, it is alsoinescapable; there is after all a limit to bow well a statistical modelcan approximate the biological reality.

An alternative to the inherent difficulty of attaching statisticalimportance to weak similarities is the use of biological knowledge indeducing sequence descriptors that model evolutionary distanthomologies. BLOCKS (see S. Henikoff and J. Henikoff, “Automatic Assemblyof Protein Blocks for Database Searching,” Nucleic Acids Research,19:6565-6572, 1991) is a system that employs pattern-induced profilesobtained over the protein classification defined in the PROSITE (see S.Henikoff and J. Henikoff, “Protein Family Classification Based onSearching a Database of Blocks,” Genomics, Vol. 19, pp. 97-107, 1994)database in order to functionally annotate new genes. The advantage hereis that this classification is compiled by experts working with familiesof proteins known to be related. As a result, even weak similarities canbe recognized and used in the annotation process. On the other hand,there is only that much knowledge about which proteins are indeedrelated and consequently being representable by a pattern. Furthermore,there is always the danger that a family of proteins actually containsmore members than is currently thought of. By excluding these othermembers from consideration, it is possible to get patterns that “overfit” the family, i.e., they are too strict to extrapolate to theunidentified family members.

Therefore, it is evident that there exists a need for methods andapparatus for creating improved pattern dictionaries through uniquedictionary formation techniques that permit improved sequence homologydetection, as well as a need for methods and apparatus for sequencehomology detection, itself, which are not limited to searching onlyannotated sequences.

SUMMARY OF THE INVENTION

The present invention provides solutions to the above and other needs byproviding improved pattern dictionary formation techniques and improvedsequence homology detection techniques, as will be described in greaterdetail below.

In a sequence homology detection aspect of the invention, acomputer-based method of detecting homologies between a plurality ofsequences in a database and a query sequence comprises the followingsteps. First, the method includes accessing patterns associated with thedatabase, each pattern representing at least a portion of one or moresequences in the database. Next, the query sequence is compared to thepatterns to detect whether one or more portions of the query sequenceare homologous to portions of the sequences of the database representedby the patterns. Then, a score is generated for each sequence detectedto be homologous to the query sequence, wherein the sequence score isbased on individual scores generated in accordance with each homologousportion of the sequence detected, and the sequence score represents adegree of homology between the query sequence and the detected sequence.

In a dictionary formation aspect of the invention, a computer-basedmethod of processing a plurality of sequences in a database comprisesthe following steps. First, the method includes evaluating each of theplurality of sequences including characters which form each sequence.Then, at least one pattern of characters is generated representing atleast a subset of the sequences in the database. The pattern has astatistical significance associated therewith, the statisticalsignificance of the pattern being determined by a value representing aminimum number of sequences that the pattern supports in the database.

Accordingly, in a significant departure from prior art approaches, themethodologies of the invention are based on the unsupervised patterndiscovery performed on arbitrary data bases without requiring any priorpartition of the database. The BLOCKS approach assumes that the databasehas been partitioned (by outside experts) in subsets of biologicallyrelated sequences. Profiles are then obtained by individually processingeach subset. As a result of this approach, BLOCKS cannot handlearbitrary databases, since not all such databases are partitioned inrelated subsets. In fact, BLOCKS works only with the SwissProt database,referenced herein, using the protein groups described in the PROSITEdatabase, also referenced herein. The present invention, on the otherhand, preferably uses the entire database as its input and provides foran automated methodology to decide which patterns are important andwhich are not.

Further, the present invention provides a new statistical framework forevaluating the statistical importance of the discovered patterns. Unlikeexisting frameworks, the approach of the invention introduces theconcept of memory in its computations. That is, for example, when aregion A on the query sequence is compared with a region B on somedatabase sequence, the resulting similarity score is evaluated by takinginto account the similarity of A to all the other sequences in thedatabase.

The use of the enhanced statistical model described herein allows thedetection of important local similarities which, using existingapproaches, would go undetected. This allows the system of the inventionto perform similarity searches at a higher level of sensitivity than ispossible using prior art systems.

Still further, the present invention provides an automated method toutilize partial annotation information available in the underlyingdatabase D. This methodology allows the user to exploit in greaterdetail similarities that seem unimportant. For example, when a patternmatches the query sequence region A, all the database regions alsomatching that pattern can be inspected. If all (or more) of thesedatabase regions are annotated in the same way, then this annotation canbe transferred to the query region A. Partially annotating the querysequence in the above manner can prove useful towards the overallsequence annotation.

The present invention also provides for a detailed methodology tocluster the database into groups of highly homologous sequence. In agenetic data processing application, this methodology allows for thecorrect treatment of multi-domain proteins.

It is also to be appreciated that the inventive concepts describedherein may be implemented on a network such as, for example, theInternet, in a client-server relationship. This allows a user to enter aquery sequence at a client device at a remote location that istransmitted to a server over the network and processed at the server.The server then returns the results of the homology search to the clientdevice of the user via the network.

These and other objects, features and advantages of the presentinvention will become apparent from the following detailed descriptionof illustrative embodiments thereof, which is to be read in connectionwith the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a sequence homology detection systemaccording to an embodiment of the present invention;

FIG. 2 is a block diagram of an exemplary hardware implementation of asequence homology detection system of the present invention;

FIG. 3 is a block diagram of a network-based implementation of asequence homology detection system of the present invention;

FIG. 4 is a high level flow chart illustrating a search enginemethodology according to one embodiment of the present invention;

FIG. 5 is a diagram illustrating an example of a pattern matchingprocess for a given query sequence according to one embodiment of thepresent invention;

FIG. 6 is a diagram illustrating an example of a hash table generatedfor a particular query sequence according to one embodiment of thepresent invention;

FIG. 7 is a diagram illustrating an example of a chaining process for agiven query sequence according to one embodiment of the presentinvention;

FIG. 8 is a diagram illustrating an example of a directed, weightedgraph generated in accordance with a scoring process for a given querysequence according to one embodiment of the present invention;

FIG. 9 is a flow diagram illustrating an embodiment of the matching andchaining phase of a search engine methodology of the present invention;

FIG. 10 is a flow diagram illustrating an embodiment of the scoringphase of a search engine methodology of the present invention;

FIG. 11 is a diagram illustrating a distribution of patterns with givenbackbone structures in SP34 and comparison with the random distributionof the same backbones;

FIGS. 12 through 15 are flow diagrams illustrating a dictionaryformation methodology according to one embodiment of the presentinvention;

FIG. 16 is a flow diagram illustrating a database clean up processaccording to one embodiment of the present invention; and

FIGS. 17 through 30 are diagrams illustrating experimental resultsassociated with the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The present invention will be explained below in the context of anillustrative genetic data processing application. However, it is to beunderstood that the present invention is not limited to such aparticular application. Rather, the invention is more generallyapplicable to creating pattern dictionaries from arbitrary databases(after appropriately translating database records to an equivalentsequence representation) and performing unrestricted homology searchesof any given query record against the data in the database.

Referring initially to FIG. 1, a block diagram of a sequence homologydetection system according to an embodiment of the present invention isshown. The illustrative system 100 includes a search engine module 110,a pattern dictionary 120, a dictionary formation module 130 and a sourcedatabase 140. As will be explained in detail below, the search engine 10receives a query sequence from a user and performs a search of thepattern dictionary 120 to attempt to locate patterns stored in thedictionary which represent sequences from the database 140 that aresimilar in some way to the query sequence. Prior to a query, thedictionary formation module 130 creates the pattern dictionary 120 fromthe database 140. This dictionary formation process is referred to asinformation gathering or mining. The search engine 110 returns some orall of the query results (e.g., homologous sequences from the database)to the user.

FIG. 2 is a block diagram of an exemplary hardware implementation of thesequence homology detection system 100. As shown, the system 100 may beimplemented in accordance with a processor 210, a memory 220 and I/Odevices 230. It is to be appreciated that the term “processor” as usedherein is intended to include any processing device, such as, forexample, one that includes a CPU (central processing unit). The term“memory” as used herein is intended to include memory associated with aprocessor or CPU, such as, for example, RAM, ROM, a fixed memory device(e.g., hard drive), a removable memory device (e.g., diskette), flashmemory, etc. In addition, the term “input/output devices” or “I/Odevices” as used herein is intended to include, for example, one or moreinput devices, e.g., keyboard, for making queries and/or inputting datato the processing unit, and/or one or more output devices, e.g., CRTdisplay and/or printer, for presenting query results and/or otherresults associated with the processing unit. It is also to be understoodthat the term “processor” may refer to more than one processing deviceand that various elements associated with a processing device may beshared by other processing devices. Accordingly, software componentsincluding instructions or code for performing the methodologies of theinvention, as described herein, may be stored in one or more of theassociated memory devices (e.g., ROM, fixed or removable memory) and,when ready to be utilized, loaded in part or in whole (e.g., into RAM)and executed by a CPU.

FIG. 3 is a block diagram of a network-based implementation of thesequence homology detection system of the present invention. As shown, aclient computer system 310 is in communication with a server computersystem 320 via a network 330 such as, for example, the Internet.However, the network could also be a private network and/or a localnetwork. According to the implementation of FIG. 3, all or a portion ofthe elements of system 100 shown in FIG. 1 reside on and are executed bythe server 330. A user operating remotely on his client computer system,e.g., a personal computer, laptop and/or some other type of personalprocessing device, enters a query sequence through application softwarerunning on the computer system, e.g., web browsing software and/or agraphical user interface associated with the search engine. The query ispassed over the network 330, in a conventional manner, and processed byserver 320. The server 320 receives the query and executes the searchengine methodologies of the invention in accordance with a storedpattern dictionary. The dictionary may have been formed by a dictionaryformation module of the invention in accordance with a source database.The server returns some or all of the query results (e.g., homologoussequences from the database) to the client via the network. It is to beunderstood that the server may represent more than one computer system.That is, one or more of the elements in FIG. 1 may reside on and beexecuted by their own computer system, e.g., with its own processor,memory and I/O devices.

Given a general description of the elements of the sequence homologydetection system of the invention and various exemplary hardwareimplementations, the various inventive methodologies will now beexplained in detail.

The respective methodologies associated with the search engine module110 and the dictionary formation module 130 will be described belowtogether in an illustrative embodiment associated with the sequencehomology detection system 100. However, it is to be understood that theinventive methodology associated with the search engine module may beutilized with other known pattern dictionaries. Likewise, the inventivemethodology associated with the dictionary formation module may beemployed to create pattern dictionaries for use with other known searchengines.

For ease of reference, the remainder of detailed description will bedivided into sections as follows: (I) Definitions; (II) Search Engine;(III) Dictionary Formation; and (IV) Experimental Results.

I. Definitions

The following section provides some notation that is used below todescribe various aspects of the invention.

Σ refers to the set of characters used for constructing sequences. Inthe biological setting (the one preferably addressed here), thesequences that we are dealing with are proteins and the set Σ is the setof the 20 amino acids. The terms protein/sequence will be usedinterchangeably below, and the same is true for the termscharacter/amino acid.

D refers to the underlying database of proteins, the one that thecollection of patterns (pattern dictionary or bio-dictionary) is builtupon. An example database that we will be using throughout thisdescription is the following (containing three sequences)

D = {s₁, s₂, s₃}, where s₁ = ARQSTLUMNPQ s₂ = FDSALQFTGMRA s₃ =RKMFPQDDSLA

Π refers to a collection of patterns, i.e., referred to herein as thebio-dictionary or the pattern dictionary 120. The exact method ofobtaining Π will be explained below in the section entitled “DictionaryFormation.” Patterns are regular expressions describing families ofpeptides. The polypeptide family represented by a single pattern isexpected to contain stretches of related (structurally, functionally,evolutionary) amino acids. More specifically, given the alphabet Σ ofamino acids, we define a pattern P in Π as a regular expression of theform:

Σ(Σ∪{‘. ’})*Σ,

where ‘.’ (Referred to as the “don't care character”) denotes a positionthat can be occupied by an arbitrary residue. Being a regularexpression, every pattern P defines a language of polypeptidesconsisting of all strings that can be obtained from P by substitutingeach don't care character by an arbitrary residue from Σ. Also, each Pin Π matches at least K_(min) sequences in D. K_(min) is an integer andits computation is explained below in the “Dictionary Formation”section. In the example below, we will assume a specified value. Theregions of the database sequences matching a pattern P are recorded inan offset list L_(D)(P) of the pattern P. This is a list containing allthe pairs (j, k) such that the pattern P matches the j-th sequence ofthe database at offset k.

For the example database introduced above, and assuming that K_(min)=2,the set of patterns is P={A.Q.T, M.PQ}. The two patterns in this setappear in the following input sequences (the matching positions areshown in bold-case):

A.Q.T M.PQ

S₁: ARQSTLUMNPQ s₁: ARQSTLUMNPQ

s₂: FDSALQFTGMRA S₃: RKMFPQDDSLA

The offset lists of the two patterns are as follows:

L_(D)(A.Q.T) = { (1, 1), (2, 4) } L_(D)(M.PQ) = { (1, 8), (3, 3) }

It is to be appreciated that the first term in each parenthetical is thesequence number and the second term is the offset. The offsetcorresponding to any character in a sequence is the distance of thatcharacter from the beginning of the sequence. For example, (2, 4)denotes that the sequence is S₂ and that the pattern A.Q.T begins at adistance of four characters from the beginning of sequence S_(2.)

Q refers to the query protein. An objective of the search engine of theinvention is to identify sequence homologies between the databasesequences in D and any query sequence Q that the user may supply. As anexample, we will use the query Q=JLANQFTLMDPQDLA. This sequence has anumber of homologous regions with the database sequences. Below, we areshowing some of them (again, pairs of similar regions are shown inbold-case):

Q: JLANQFTLMDPQDLA Q: JLANQFRLMDPQDLA

s1: ARQSTLUMNPQ s2:FDSALQFTGMRA

Thus, the search engine identifies similarities such as the ones shownabove. Two regions of equal length are similar if, when put one underthe other, they have several matching characters lined up together. Theexact notion of similarity will be made precise in what follows; fornow, it suffices to say that it involves the use of scores for everypossible pair of characters. Every such score is a measure of fitness,identifying how biologically probable it is for two characters to belined up together.

Given a pattern P, a “backbone” of P is defined as a string over thealphabet {1, 0} obtained from P by turning every residue of P into thecharacter ‘1’ and every don't care into the character ‘0’. For example,the backbone of the pattern P=“A.. DFE” is the string “100111”.Backbones partition the set of patterns into equivalent classes witheach class containing all the patterns sharing the same backbone.

Another concept that may be employed in accordance with the invention isthat of the “density” of a pattern. Generally, the density describes theminimum amount of homology between any two members of G(P) (where G(P)refers to the language of polypeptides consisting of all strings thatcan be obtained from P by substituting each don't care character by anarbitrary residue from Σ) and is defined by two integers L and W (L≦W):a pattern P has an <L, W> density if every substring of P that startsand ends with an amino acid and has length at least W contains L or moreresidues. In every such pattern, the ratio of the number of residuesover the length of the pattern is at least L/W. The integers L and W areparameters of our preferred method and their values control the amountof similarity allowed in the searches performed. These parameters arealso described in detail in a U.S. patent application identified by Ser.No. 09/023,756, filed on Feb. 13, 1998, directed toward the “TEIRESIAS”algorithm, which claims priority to a U.S. provisional applicationidentified by Serial No. 60/049,461, filed on Jun. 12, 1997, thedisclosures of which are incorporated herein by reference. Notice that,by definition, an <L, W> pattern has at least L residues.

Further, given a pattern P and a sequence S, any substring of S thatbelongs to G(P) is called a matching site for P. The offset list of Pcontains the offsets of the first character of all matching sites of P.

Given the above definitions, we can now provide a general description ofa preferred approach to improved sequence homology detection accordingto the invention, for example, in association with system 100 (FIG. 1).Sequence homology detection comprises two distinct phases: informationgathering and searching.

First, before any search is performed, the underlying database D ismined. This mining procedure is also referred to as informationgathering or dictionary formation. During this step all the significant<L, W> patterns are gathered and each such pattern P is associated withits offset list L_(D)(P) (the particular criterion used for deciding ifa pattern is significant or not is detailed in the search enginesection).

The second step is the actual search. Given a query sequence Q, weidentify all the patterns P (among those collected in the first phase ofthe process) which are matched by Q. For every such P, we pair togetherthe region(s) of Q which match P with the corresponding regions of allthe database sequences that also match P (these regions are easilyaccessible through the offset list L_(D)(P)). Finally, the pairedregions are extended and aligned in both directions and scored by theuse of a (user-defined) mutation matrix and the highest scoring matchesare reported along with the implied alignments.

It is worth pointing out here that the information gathering phase is anone-time computation over D. The results obtained are stored in a file(pattern dictionary 120 of FIG. 1) and used every time that a searchsession is performed over the database D.

A motivation behind using patterns for describing related polypeptideslies on biological facts. In particular, it is known that there is anumber of basic elements (either of structural nature like alphahelices, beta strands, loops, etc., or larger, functional units likemotifs, modules and domains) which are the building blocks that proteinsare made of. One of the key mechanisms used by evolution fordifferentiating among species is the mutation of amino acid positions ina protein sequence. Functionally/structurally important regions, though,are more resistant to such mutations. It is then reasonable to expectthat such biologically related polypeptides can be identified bydiscovering: (a) conserved positions in their primary structure; and (b)an increased degree of reusability. In our terminology, these propertiescorrespond to patterns with unexpectedly high support.

However, it is important to reiterate that the inventive search enginemethodology described herein may be utilized with other known patterndictionaries. Likewise, the inventive dictionary formation methodologymay be employed to create pattern dictionaries for use with other knownsearch engines.

It is to be appreciated that both methodologies are described below insections II and III, respectively. While the dictionary formation methodis applied before the search method, for the sake of facilitating theexplanation, we discuss the processes in reverse order, starting withthe search engine method and followed by the dictionary formationmethod.

II. Search Engine

Referring now to FIG. 4, a high level flow chart illustrating a searchengine methodology according to one embodiment of the invention isshown. This methodology may be employed by the search engine 110 in FIG.1. The operation of the search engine can be broken down into twodistinct phases: (i) pattern matching+chaining (block 402); and (ii)scoring (block 406).

The first phase checks every pattern P in Π (recall that Π refers to thepattern dictionary 120 mentioned above) against the query sequence Q,isolating all the patterns that match Q. Below we describe a specificalgorithm for performing this “check for a match” process; however, anymatching algorithm can be used. Notice the “Complexity Checking” (block404) of phase 1 in FIG. 4. In some cases, it is possible that a patternP matches the query Q and yet it is undesirable to take this match intoaccount. Such examples are the so called “low-complexity” patterns. Suchpatterns arise sometimes due to the nature of biological sequences.Low-complexity patterns are comprised almost exclusively of the sameamino acid, e.g., the pattern “A.A..AAA.A.A” and appear because someproteins have long regions of repeating amino acids. Such patterns,though, are not considered important for homology detection purposes andit may be better to ignore any matches induced by these patterns. Thedecision to do so or not is left to the system users, by allowing themto set a “complexity-checking” component within the search engine to its“ON” or “OFF” state. For now, it is enough to remember that in the casewhen this component is set to “ON,” some patterns in P will be ignoredeven though they match the query protein Q. Below we will give adescription of the precise conditions under which a pattern P matching Qis ignored when the complexity checking is “ON.”

Continuing with phase 1, every pattern P that matches Q generates alocal homology between Q and all the database regions also matching P.Those latter regions are easily accessible through the offset listL_(D)(P) of P. Assuming that P matches Q at offset i, every region (j,k) in L_(D)(P) gives rise to the segment (i, j, k, l), where l is thelength of the pattern P. This is explained in detail below. Finally, asthe matching process goes on, compatible segments get chained together,forming longer segments (the notion of compatible segments as well asthe operation of chaining will be explained below). At the end of phase1, we are left with a set R containing all the sequences of the databaseD that match at least one pattern P in Π such that P also matches Q.Each sequence S in R is accompanied by the segments that describe thepattern-induced homologies between Q and S.

Consider the example that we introduced above. The query sequenceQ=JLANQFTLMDPQDLA, matches both the patterns P1=“A.Q.T” and P2=“M.PQ” inΠ. Given that P1 matches Q at offset 3 and P2 matches Q at offset 9,these 2 matches give rise to the following 4 segments:

(3, 1, 1, 5) (3, 2, 4, 5) —from L_(D)(P1)

(9, 1, 8, 4) (9, 3, 3, 4) —from L_(D)(P2)

and the set R is:

R={s ₁—(3, 1, 1, 5) (9, 1, 8, 4)

s ₂—(3, 2, 4, 5)

s ₃—(9, 3, 3, 4)}

where each sequence s₁ in R carries along a list of segments. Noticethat in this particular example no chaining was possible.

The second of the phases of the search engine methodology depicted inFIG. 4 assigns a score to every sequence S in R. There are a number ofapproaches for computing this score for a given S_(j). Each one, though,starts by scoring all the segments carried along by S_(j). Each segmentreceives a score (these scores are called segment scores). The scoringis performed based on a mutation matrix M. Mutation matrices are 20×20arrays of reals. The (i,j)-th entry of such a matrix indicates theprobability that the i-th amino acid has changed during evolution to thej-th amino acid. For our purposes here, it suffices to assume that M isa function from Σ×Σ—>R which, when given two amino acids A1, A2 asinput, returns a real number. Since there are many mutation matricesthat may be used, the user is given the option of choosing theparticular matrix M to use.

For example, assume that we are using the unary mutation matrix M, i.e.,M(A, A)=1 for all amino acids A, and M(A,B)=0 for all distinct aminoacids A, B. Consider the first sequence in the set R above, namely thesequence s₁ which carries along the segments (3, 1, 1, 5) and (9, 1, 8,4). We show how to score the first of these two segments (the other one,as well as all segments in the set R are scored similarly). Imagine thatwe have aligned (one under the other) the two protein regions of length5 implied by the segment, i.e., the region starting at offset 3 of Q andat offset 1 of s₁:

ANQFTL (from Q)

ARQSTL (from s₁)

The score of the segment is then computed by summing over all thealigned columns the values M(X,Y) where X, Y are the two amino acidsaligned together under any given column. For the segment above, thisscore is:

M(A, A)+M(N, R)+M(Q, Q)+M(F, S)+M(T, T)+M(L, L)=1+0+1+0+1+1=4.

The scoring scheme described above for the segments is a basic scoringscheme. That is, the system user can set a number of options to modifythe way in which the segment score is computed. For example, if thesystem parameter extend (which is an integer and is described below) hasbeen set to a value larger that 0, the scoring takes into account notonly protein regions described by the segment, but also an area ofextend amino acids left and right of the two regions (the scoringproceeds exactly as described above, only that now longer regions areconsidered). Furthermore, if a gapped_alignment option has been set,then in the alignment of the extended regions (i.e., those regions leftand right of the basic segment) we also use gaps in order to maximizethe alignment score.

At the end of the above process (independent of which scoring variant isused), a segment score will have been computed for every segment in theset R. These segment scores are then utilized for the final step of thescoring phase, namely, quantifying the amount of similarity between Qand all the sequences S_(j) in R. This quantification is performed byassigning a score to every S_(j) in R; this score is called the sequencescore for S (in order to distinguish it from the segment scores).Ideally, the higher the sequence score is for a sequence S_(j), the moresimilar this S_(j) should be to Q.

In scoring S_(j), we only consider the segment scores of the segmentsthat S_(j) carries along. There are, again, several options. In thesimplest case, the sequence score for S_(j) is defined as the maximumamong all the segment scores for the segments of S_(j). A second, moreinvolved approach is described below. Here, a directed graph is firstbuilt for the sequence S_(j) being scored. The vertices of this graphare all the segments that S_(j) carries along. Every vertex is assignedthe segment score of the segment corresponding to that vertex. An edgeis placed from the segment (i, j, k, l) to the segment (i′, j, k′, l′)if

i<=i′ and k<=k′,

i.e., if the relative order of the two query regions described by thetwo segments (the regions Q[i..i+l−1] and Q[i′..i′+l′−1]) is the samewith the relative order of the two regions on S_(j) described by the twosegments (the regions S_(j)[k..k+l−1] and S_(j)[k′..k′+l′−1]). As withthe vertices, every edge is also assigned a score, representing howregular is the displacement of the regions in the query (i.e., thedifference i′−i) relative to the displacement of the regions on S_(j)(i.e., the difference k′−k). The larger the difference between thedisplacements (i.e., the number |(i′−i)−(k′−k) |), the smaller the scoreof the edge. After the graph has been built, we can apply any standardlongest path algorithm for identifying the path with the highest score(the score of a path is defined as the sum of all vertex and edge scoresin the path). This score then becomes the sequence score for S_(j).

Above, we have described several ways of computing both the segment andthe sequence scores. In general, any other “biologically reasonable”scoring scheme can be used in their place.

Referring now to FIGS. 5, 6 and 7, more specific examples of the patternmatching, chaining and scoring processes of the search enginemethodology 400 will now be explained. Again, during the search phaseimplemented by search engine 110, query proteins Q are provided to thesystem and database sequences Sε D similar to Q are identified andreported back to the user. The searching phase utilizes a set Π ofpatterns obtained by mining the input database D. For the purposes ofthe example here it is sufficient to assume that Π is a set of <L, W>patterns of the form described in the “Definitions” section above. Eachpattern Pε Π is accompanied by its offset list L_(D)(P) and has supportat least K_(min) in D. The numbers L, W and K_(min) are parameters ofour preferred method and the way in which they are set will be describedbelow in the “Dictionary Formation” section.

The first thing to do when a query sequence Q is provided to the systemis to locate all Pε Π that are matched by Q. This can be done very fastby using a hashing variation of a technique presented in D. Gusfield,“Algorithms on strings, trees and sequences: Computer Science andComputational Biology”, Cambridge University Press, 62-63, 1997. Morespecifically, for every position within Q we generate W hash values, onefor every substring of length 2, 3, . . . , (W+1) starting at thatposition. For every such substring, the corresponding hash value dependsonly on the first and last characters of the substring, as well as onthe number of residues in between these two characters.

FIG. 5 provides an example of the process for a given query sequence. Inthe example, the hash values generated for the W=4 substrings startingat position 6 of the sequence Q are shown. The hash value used for asubstring s is:

H(s)=((av(first _(—) char)−av(‘A’))+(av(last _(—)char)−av(‘A’))*26)*W+gap

where av(c) is the ASCII value of the character c, first_char andlast_char are the first and last characters of s respectively and gap isthe number of residues in between the first and last characters of s.Notice that because of the <L, W> density restriction, the gap is alwaysless than W.

The hash entry corresponding to a particular value h contains all theoffsets p of the query sequence Q such that a substring (of length atmost W+1) starting at p hashes to the value h. FIG. 6 gives an exampleof the hash table generated for a particular query sequence. In FIG. 6,a snapshot of the hash table generated for the sequenceQ=AFGHIKLPNMKAMGH is shown. Instead of using actual numeric hash valuesin order to label the table entries, we use a pattern describing all thestrings that hash to a particular hash value. Each hash entry points toa list of offsets. Every offset in that list marks the beginning of asubstring in Q which hashes to the relevant hash entry.

In order to check if a pattern P ε Π is matched by Q we use an array ofcounters C[|1..|Q|] of size equal to the length of Q. Initially, everyentry of the array is set to 0. Starting at offset 1 in P, we locate alloffsets j within P corresponding to a residue, excluding the offsetcorresponding to the last residue. For every such j, let F be theshortest substring of P starting at j and containing exactly tworesidues. Let OL denote the list of offsets in Q pointed to by the hashtable entry corresponding to F. If OL is not empty, then for everyoffset p ε OL, the counter C[p−j+1] is incremented by one. If thepattern P contains exactly n residues, then at the end of this processthe counter C[i] will have the value (n−1) if and only if Q matches P atoffset i. An advantage of the matching technique described above is thatit typically requires time which is sublinear to the size of the querysequence Q and only depends on the number of residues in the pattern P.

Once a pattern Pε Π is found to be matched by a substring of Q startingat offset i, we need to relate that substring of Q with all the database regions also matching P. This is easily done by scanning the offsetlist L_(D)(P) which contains exactly these regions. More specifically,each entry (j, k)εL_(D)(P) indicates that the substring starting atoffset k of the j-th database sequence S_(j) is an element of G(P). Thelocal similarity between the query sequence Q and the database sequenceS_(j) is then registered as a quadruplet (i, j, k, l), called a segment,which gets associated with S_(j). The number 1=|P| is the length of thelocal similarity.

Sometimes, two distinct patterns P and P′ matching both Q and a databasesequence S_(j,) correspond to the same local similarity between Q andS_(j). An example of such a situation is depicted in FIG. 7. In suchcases, the individual segments corresponding to the two patterns must bechained into one. In particular, two segments (i,j, k, l) and (i′, j,k′, l′) associated with S_(j) are called compatible if and only if:

k<=k′ and k+1+w _(—) len>k′ and k′−k=i′−i

where w_len is an integer parameter defined by the user; w_len allowsfor the chaining of segments which are not intersecting as long as onestarts no more than w_len positions after the end of the other. Thesegment resulting from chaining (i, j, k, l) and (i′, j, k′, l′)together is:

(i, j, k, max(l, k′−k+l′))

Chaining of compatible segments takes place every time that a newsegment gets associated with a database sequence S_(j), as the result oflocating a pattern Pε Π matched by both Q and S_(j). If there aresegments already associated with S_(j) which are compatible with thenewly arriving segment, then the relevant pair of the new and theexisting segment is discarded and gets replaced by the outcome of theirchaining.

Having identified all the local similarities between Q and the databasesequences, we are left with the task of evaluating these similarities.This is done by assigning a score (using a user defined scoring matrix)to every database sequence S_(j) that is associated with at least onesegment. Several options are available for the scoring function. One ofordinary skill in the art will appreciate other ways to score given theinventive teachings herein.

As mentioned above, one approach is to score each segment of S_(j)individually and assign to S_(j) the highest of these scores. Scoring asegment (i, j, k, l) can be done in either of two ways:

no gaps allowed: in this case the score is computed from the ungappedalignment implied by the segment, namely, the alignment of the regionsQ[i, i+l−1] of the query and S_(j)[k, k+l−1] of the sequence.Furthermore, the user is given the option to extend the alignment“around” the segment by setting the variable extend. If this value isgreater than 0, then the score is computed from the ungapped alignmentof the regions Q[i-extend, i+l−1+extend] and S_(j)[k-extend,k+l−1+extend].

allowing gaps: this option is available only when extend >0 and permitsfor a finer scoring of the area around the segment, by allowing for gapsin that area of the alignment.

As mentioned above, other scoring options are also offered, taking intoaccount the relative order of the segments associated with the databasesequence S_(j) currently being scored. One approach after scoring eachsegment individually as described above, is to build a directed,weighted graph, as shown in FIG. 8. The vertices V of this graph are thesegments associated with S_(j) and there is a directed line between thesegments (i, j,k, l) and (i′, j, k′, l′) if:

i<=i′ and k<=k′.

Every vertex is assigned a weight equal to the score of thecorresponding segment, while every edge E is weighted based on: (a) howclose the two segments are, i.e., the value of (i′−i−1); and (b) howregular is the displacement among the two segments, i.e., how muchdifferent (i′−i) is from (k′−k). The score of a path within this graphis the sum of the weights of all the vertices and edges of the path. Thepath with the maximal score is then computed and that score is assignedto S_(j).

Referring now to FIGS. 9 and 10, respective flow charts summarilyillustrate embodiments of the two phases performed by a search enginemodule of the invention. FIG. 9 depicts an embodiment 900 of thematching and chaining phase, while FIG. 10 depicts an embodiment 1000 ofthe scoring phase.

In FIG. 9, it is assumed that every sequence S_(j) in the database D hasan associated list of segments SegL(S_(j)). Initially, all these listsare empty. Also, the set R is initially empty. As the computationdescribed by the flow chart of FIG. 9 proceeds, R is populated bysequences S_(j). As such a sequence gets inserted into R, it bringsalong its segment list SegL(S_(j)).

Thus, for every pattern P in Π (block 902), the search engine performsdoes the following operations. In step 904, the search engine determineswhether P matches Q. If no, then go to the next P in the dictionary. Ifyes, in step 906, the search engine determines whether the complexitychecking component has been enabled by the user. If it has been enabled,in step 908, the engine determines is the match of P to Q is a lowcomplexity match (to be explained in greater detail below). If yes, thenthe engine goes to the next P in the dictionary. If no, then, for alloffsets i where P matches Q (bock 910) and for all (j, k) in L_(D)(P)(block 912), the engines performs the following operations. In step 914,it chains the segment (i, j, k, |P|) with any compatible segment inSegL(S_(j)). Then, the engine adds the result in SegL(S_(j)).

In step 916, the engine determines whether S_(j) is in R. If yes, thenthe engine returns to step 914. If no, then the engine adds S_(j) andSegL(S_(j)) in R. Steps 914 through 916 are performed for all offsets iwhere P matches Q and for all (j, k) in L_(D)(P). The entire process(steps 904 through 916) is repeated for every P in the patterndictionary.

Now that matching and chaining has been performed, the search engine thescoring operations in FIG. 10. So, for all sequences S_(j) in R (block1002) and for all segments s in S_(j) (block 1004), the engine computesa segment score for s, in step 1006. Then, for all sequences S_(j) in R(block 1008), the engine computes a sequence score for S_(j), in step1010. Lastly, in step 1012, the engine reports the highest scoring S_(j)in R along with the local alignments implied by their respectivesequence scores.

Referring again to FIG. 4 and as previously mentioned, the search enginemodule 110 may include a complexity checking component (e.g., step 906of FIG. 9). The complexity checking component is responsible fordiscarding local homologies generated because of low complexity regions.First, the low complexity checking happens in two phases: both duringthe dictionary building phase (“Dictionary Formation” section) as wellas in the Searching phase (this section).

During the dictionary building phase, low complexity regions are dealtwith in two ways. First, when looking for patterns in the inputdatabase, we disregard (i.e., remove from the input) all protein regionsthat constitute of L or more consecutive appearances of the same aminoacid (L is an integer parameter that we set during the dictionarybuilding phase; for our purposes here it suffices to assume that it hassome fixed value). This takes care of low complexity regions like theone shown in boldface below (the dots indicate that there are more aminoacids left and right of the depicted string):

. . . . . . ASDFHRTYIUSFFFFFFFFFFFFFFFFFFAKJRBVCJ . . . . .

This is, however, just one case of a low complexity region. Many moreexist. Consider, for example, the bold-faced part of the followingregion:

GFWRETIOJIFPAPAPAPAPAPAPAPAPAPAPAPAJSHDGF . . . .

To deal with regions of that sort (i.e., of a generalized repetitivecomposition), we also discard all overlapping appearances of a givenpattern P. In other words, if the pattern P matches the databasesequence S_(j) at the offsets k₁ and k₂ (where k₂>k₁) and k₂−k₁ is lessthan the length of P, then neither offset is placed in the offset listL_(D)(P) of the pattern P. For example, in the region shown above, thepattern “P.P.PA” which has a length of 6, appears (among other places)at offsets 12 and 14, i.e., at overlapping positions, since 14−12=2 and2<6.

During the search engine phase now, we have two ways of capturing anddiscarding low complexity homologies. The first one is a generalizationof the example given above. In short, we would like to discard allpatterns that are not “linguistically rich,” i.e., they exhibit anover-representation of one specific amino acid. For that purpose, weallow the user to set the value of a parameter V (a real between 0 and1). A pattern P matching the query sequence Q will be further consideredonly if a variability v(P) of P is no more than the value V. Inparticular, for each pattern P, we define its variability v(P) as:${v(p)} = \frac{\max\limits_{R}\quad \left\{ {{number}\quad {of}\quad {times}\quad {that}\quad {the}\quad {residue}\quad R\quad {appears}\quad {in}\quad P} \right\}}{{total}\quad {number}\quad {of}\quad {positions}\quad {in}\quad P\quad {covered}\quad {by}\quad {residues}}$

Even after passing the variability test described above, a second levelof checking follows. This second level intends to capture a more subtlenotion of low complexity. To understand how it works, consider thefollowing example. Let us assume that the query protein Q is thefollowing simple string:

Q =FRGDSAAABBBBAABBSJIEKL

and let us consider the pattern P=“A...B..AB”. The pattern matches thequery at offset 7, as shown below:

A . . . B. . AB

FRGDSAAABBBBAABBSJIEKL

The region of the match and its immediate surrounding is a lowcomplexity region (it is comprised of just ‘A’s and ‘B’s). The patternP, however, has a variability of just 0.5. To deal with low complexityregions of this character, we allow the user to define integers marginand min_m (where min_m<=2*margin) as well as a percentage perc. We thencheck for approximate matches of the pattern under consideration (herethe pattern “A...B..AB”) in margin characters left and margin charactersright of the actual matching site (here the offset 7 of the query). Apattern P matches approximately at a given offset of the query if, whenplaced at that offset, at least perc % of the regular characters in thepattern match the underlying characters of the query. For example, ifperc =75%, then the pattern “A...B..AB” approximately matches Q atoffsets 6 and 8, as shown below:

A . . . B. . AB

FRGDSAAABBBBAABBSJIEKL (at offset 6)

A . . . B . . AB

FRGDSAAABBBBAABBSJIEKL (at offset 8)

since, in each of these offsets, 75% of the pattern regular characters(i.e., 3 out of the 4) match the corresponding query characters. Havingdefined the parameters margin, min_m and perc, we are now ready to saywhen a pattern induced local homology between the query and a databaseregion is deemed of low complexity during this level of checking.Consider that a pattern P matched the query Q at offset X and a databasesequence S at offset Y. The matching will be considered of lowcomplexity if either of the following is true: (i) the pattern matchesthe query Q approximately in at least min_m of the 2*margin charactersleft and right of X.; or (ii) the pattern matches the sequence Sapproximately in at least min_m of the 2*margin characters left andright of Y.

III. Dictionary Formation

As previously mentioned, in a preferred embodiment, the dictionaryformation methodology is performed prior to a the search enginereceiving a query sequence from a user. This is because, referring againto FIG. 1, the search engine module 110 preferably utilizes the patterndictionary 120 formed by the dictionary formation module 130. Thedictionary formation module 130 implements the inventive databaseprocessing methodology that is explained below to form the patterndictionary (or bio-dictionary). However, also as previously mentioned,the pattern dictionary 120 may be used by search engines other than theone described herein. That is, existing search engines may utilize thepatterns mined from a source database in accordance with the invention.Nonetheless, in accordance with a preferred embodiment, it will beassumed that the pattern dictionary formed according to the inventivemethodology described herein will be used by the inventive search enginealso described herein.

During the dictionary formation phase (also referred to as aninformation gathering phase), the set Π of all the significant <L, W>patterns found in the database D under consideration is determined. Thisis, in essence, a data mining procedure in which D is exploited with theintention to discover hidden relationships among the sequences of D. Theidea is to focus on those relationships which are considered unexpectedand by virtue ofthat quality they are also presumably of biologicalrelevance. For our purposes, the significance of a pattern will bedescribed by its support within D. More specifically, we will seek todefine the number K_(min) (the minimum support) such that every patternwith support at least K_(min) can be shown to be statisticallyimportant. All such patterns (along with a few exceptions which do notabide by the minimum support requirement) will be included in the set Π,the input to the search phase.

Recall that the concept of K_(min) was first introduced above in the“Definitions” section. Also, the concept of “density” was alsointroduced. Recall, that density describes the minimum amount ofhomology between any two members of G(P) (where G(P) refers to thelanguage of polypeptides consisting of all strings that can be obtainedfrom P by substituting each don't care character by an arbitrary residuefrom Σ) and is defined by two integers L and W (L≦W): a pattern P has an<L, W> density if every substring of P that starts and ends with anamino acid and has length at least W contains L or more residues. Again,these parameters are described in the above-incorporated U.S. patentapplication identified by Ser. No. 09/023,756, filed on Feb. 13, 1998,directed toward the “TEIRESIAS” algorithm.

While a preferred methodology of the invention utilizes parameters L andW in forming the pattern dictionary Π, it is to be appreciated thatother known techniques for determining the minimum amount of homologybetween any two members of a group of sequences may be employed.

Setting the values of parameters L, W and K_(min) involves theconsideration of a number of sometimes conflicting and interconnectedfactors. The ratio L/W, for example, describes the amount of homologyallowed, during the search phase, between a query sequence and theproteins in D. A small L/W will permit the detection of weaksimilarities. Since several value pairs (L, W) lead to the same ratioL/W, what should the exact settings for L and W be? Opting for a largevalue of L will usually result in a long running time for theinformation gathering phase (unless L/W is close to 1). Furthermore,selecting a large L would ignore weak patterns with only a few aminoacids, which are among the ones that are of interest (i.e., are usuallymissed by existing similarity searching tools). Selecting too small an Lon the other hand (e.g., 2 or 3) may be useless since in that case thedistribution of <L, W> patterns with L+i residues (for small i) in theinput database D is not significantly different from the correspondingdistribution in a random database with the amino acid composition of D.In the most general case, it is to be appreciated that the values L, Wand K_(min) can be chosen completely arbitrarily. However, in order tosubstantially guarantee that the discovered patterns are well above thelevel of statistical noise, we augment the pattern discovery processwith a statistical framework (i.e., a way to set the parametersmentioned above).

To make the above point more clear, consider FIG. 11 which compares thedistribution of patterns in a test database known as SwissProt Rel. 34or SP34 (see A. Bairoch and R. Apweiler, “The SWISS-PROT proteinsequence data bank and its supplement TrEMBL in 1998”, Nucleic AcidsRes., 26:38-42, 1998) with the corresponding random distributions. FIG.11 depicts a distribution of patterns with given backbone structures inSP34 (the distributions being denoted by the “o” symbols) and comparisonwith the random distribution (the distributions being denoted by the “+”symbols) of the same backbones. Recall that the concept of a backbonewas first introduced above in the “Definitions” section. A point (X, Y)in a curve indicates that there are exactly Y patterns (of the givenbackbone structure) such that each of these patterns has support X,i.e., it is matched by exactly X distinct database sequences. Theresults shown here were obtained using a “cleaned-up” version of SP34(cleaning up of a database is explained below). For SwissProt, wecomputed the support of each <L, W> pattern with exactly L residues (forthe values of L, W shown in FIG. 11). Then, the results were tabulatedcreating one row for each possible backbone: the i-th column of the rowcorresponding to a given backbone B indicates the number of patterns (ofthat backbone structure) with support i within SwissProt. The randomdistributions were obtained by following exactly the same approach forN=2000 randomly shuffled versions of SwissProt (FIG. 13 describes theshuffling process by which each one of the shuffled versions isobtained). In this case, the row for a given backbone B is obtained byaveraging the rows corresponding to B in all the 2000 tables. As aresult, the i-th column gives a sufficiently accurate estimate of themean number of patterns with backbone B that appear in exactly isequences within a random data base having the residue composition ofSwissProt. In FIG. 11, we plot the SwissProt results of selectedbackbones against the distribution of the means for the same backbones.Although the results presented involve particular backbones, there is noqualitative change if other backbones are used.

Notice that we are using 2000 sampling points (the randomly shuffledversions of the input database). This is just for illustrative purposes.The actual number of sampling points can, in principle, be setarbitrarily. In general, as the number of such points becomes larger,the estimates that we obtain converge more accurately to their truevalues. Given any desired confidence level for the estimations to becomputed, standard statistics theory can be used to decide how manysampling points to use.

As FIG. 11 implies, we start distinguishing the compositional bias (interms of patterns) in SwissProt versus a random database only when Lbecomes 5 or larger. In general, the value of L will depend on the sizeof the underlying database D: the larger the database, the higher thisvalue should be. The results shown for SwissProt have been obtainedusing the value L=6. For W, we chose the value 15, so that the ratio L/W(i.e., the minimum allowed homology) is 40%.

Having set the values of L and W it remains to decide what the minimumsupport K_(min) should be. We focus only on patterns with exactly Lresidues since every larger pattern contains at least one subpatternwith exactly that many amino acids. One approach is to select K_(min) sothat the probability of a pattern appearing in K_(min) or more distinctsequences is small. A closer look at FIG. 11(d), though, reveals thatthis approach might be too strict. In particular, consider a supportlevel of K=15. The random distribution indicates that one expects, bychance alone, between one and two patterns with support K. So, accordingto the aforementioned criterion, a pattern with support 15 withinSwissProt would be deemed not important. However, the two distributionshave a striking difference at that support level. In particular, whilethe mean of the random distribution at K=15 has a value of about 1.5,within SwissProt there are about 180 patterns with support 15.

So, it seems that if one considers the probability of a pattern inisolation the result will be to discard many patterns which, accordingto the above distribution, are above the level of noise. Thisobservation prompts us to use a different criterion for significance.

Referring now to FIGS. 12 through 15, we present flow diagramsillustrating a preferred approach for determining a significancecriteria. That is, we provide a methodology for computing the value ofK_(min). Given the value of K_(min), the pattern dictionary Π is formedby including therein all patterns in the source database D that have atleast that value K_(min) as support. Thus, it is to be understood thatthe dictionary formation module 130 of FIG. 1 may perform the processesdepicted in FIGS. 12 through 15.

Generally, in our approach, instead of looking at individual patterns,we consider together all the patterns of a particular backbonestructure. More specifically, for any given backbone B and an underlyingdatabase D, let N_(B,K) be:

N_(B,K)=number of patterns with backbone B which have support K withinD. Also, let X_(B,K) be the random variable (defined over the space ofall shuffled versions of D) corresponding to N_(B,K). The minimumsupport K_(min) is then the first number K for which the followinginequality is true:${\max\limits_{B}\left\{ {\Pr \left\lbrack {X_{B,K} \geq N_{B,K}} \right\rbrack} \right\}} \leq {threshold}$

where threshold is a user defined probability imposing a level ofconfidence on the minimum support K_(min) coming out of the aboveinequality. A smaller threshold leads to a larger value for K_(min) andto a greater statistical importance for the patterns that will beeventually selected.

Thus, as inputs to the process for determining the significance criteriaK_(min), we have the source database D, the integer parameters L and W,an integer N representing a number of samples, and threshold which is areal number between 0 and 1. Of course, as the output of the process, weget the integer K_(min), such that all patterns in D that have supportK_(min) or more are statistically important and therefore are includedin the pattern dictionary to be searched upon receipt of a user query.

The following explanations of the flow charts uses various notation,some of which has been introduced above. However, for the sake ofclarity, the following definitions apply. Given any pattern P, thebackbone B(P) of P is defined as the string over {1, 0} obtained whenreplacing every regular character of P with ‘1’ and every don't carecharacter of P with a ‘0,’ e.g., if P=A..F.G..R, then B(P)=100101001. IfB is an arbitrary backbone and P is a pattern such that B(P)=B, then wesay that P is a B-pattern. N_(B,K) is then said to be the number ofB-patterns with support K in D, X^(j) _(B,K) is the number of B-patternswith support K in the i-th random database. While m_(B,K) is the averageof all X^(j) _(B,K) and s_(B,K) is the variance of all X^(j) _(B,K). Itis to be appreciated that since we do not have an analytical descriptionfor the distribution of the random variable X_(B,K), we employ standardsampling techniques. Thus, for a given database D, we are able toexperimentally compute accurate point estimates for both the average(mean) m_(B,K) and the variance (deviation) s_(B,K) of the randomvariable X_(B,K).

Referring initially to FIG. 12, the overall process 1200 starts byrunning the TEIRESIAS algorithm (i.e., as described above and in theabove-incorporated U.S. patent application identified by Ser. No.09/023,756, filed on Feb. 13, 1998) on D and computing N_(B,K), in step1202. While the TEIRESIAS algorithm is preferred, it is to be understoodthat N_(B,K) may be computed using other conventional techniques. Thenfor i=1 to N (block 1204), the following steps are performed.

In step 1206, a random database R_D_(i) is generated. This step isfurther explained in the context of FIG. 13. As shown in process 1300,R_D_(i) (block 1302) is generated by computing, for each sequence S in D(block 1304), a random permutation of the characters in S (step 1306).The random permutation of the characters in S is referred to as S′. S′is added to R_D_(i) (step 1308). The process is repeated until everysequence S in D is processed (block 1310). Thus, R_D_(i) includes allrandom permutations S′. Returning to FIG. 12, in step 1208, TEIRESIAS isrun on R_D_(i) in order to compute X^(i) _(B,K). Steps 1206 and 1208 arefor all i's, that is, until i=N (block 1210).

Then, for every B, K (block 1212), we use the X^(i) _(B,K) for computingm_(B,K) and s_(B,K). This step is further explained in the context ofFIG. 14. As shown in process 1400, s_(B,K) is first set to 0 (step1402). Then, for i=1 to N (block 1404), s_(B,K) is computed as the sumof s_(B,K) and X^(i) _(B,K) (step 1406). The process is repeated for alli's (block 1408) and the mean s_(B,K) is finally computed by dividings_(B,K) by N (step 1410). Then, the deviation m_(B,K) is computed insteps 1412 through 1420. First, in step 1412, m_(B,K) is first set to 0.Then, for i=1 to N (block 1414), m_(B,K) is computed as the sum ofm_(B,K) and (X^(i) _(B,K)−s_(B,K))²(step 1416). The process is repeatedfor all i's (block 1418) and the deviation m_(B,K) is finally computedby dividing m_(B,K) by N (step 1410).

Returning to FIG. 12, in step 1216, P_(B,K) is now computed usingm_(B,K) and s_(B,K) This step is further explained in the context ofFIG. 15. As shown in process 1500, a real number C is defined in step1502, such that:$N_{B,K} = {\left( {m_{B,K} + {1.96\frac{s_{B,K}}{\sqrt{N}\left( {1 + \sqrt{\frac{1.96}{\sqrt{2\quad N}}}} \right)}}} \right) + {C\left( \frac{s_{B,K}}{1 + \sqrt{\frac{1.96}{\sqrt{2\quad N}}}} \right)}}$

where N represents a particular number of samples or trials, e.g., 2000.Thus, in step 1504, P_(B,K) is computed as being equal to$\frac{1}{C^{2}}.$

It is to be understood that P_(B,K) is an upper bound for theprobability Pr[X_(B,K)>N_(B,K)]. Thus, in summary, we use the samplemean and deviation of X_(B,K) to compute C for the values of N_(B,K) athand. It is to be appreciated that the constant C is associated withChebychev's inequality, as is well known in the art of statistics. Notethat the constant C is calculated using a confidence level of 95%,however, this is not a requirement. That is, any other value would beapplicable as well.

Returning to FIG. 12, steps 1214 (FIG. 14) and 1216 (FIG. 15) arerepeated for every B,K. Then, in step 1220, K_(min) is determined to bethe smallest K such that max_(B){P_(B,K)}≦threshold. In the test casepresented in the next section (SwissProt. Rel. 34), the value ofthreshold has been chosen so that K_(min)=15, i.e., the support levelwhere only 1.5 patterns of a given backbone structure are expected bychance. There is a tradeoff: we are willing to allow a small number ofpattern-induced local homologies which can be the result of chance (the1.5 patterns above) in order to be able to capture the many morestatistically important similarities implied by the other patterns atthat same support level present within SwissProt.

Before providing some experimental results in the next section, we firstexplain the concept of cleaning up a database before performing thedictionary formation methodology of the invention. This process isdepicted in FIG. 16 and also may be implemented by the dictionaryformation module 130 of FIG. 1. Several databases contain groups ofhighly homologous sequences (e.g., the hemoglobin alpha chain proteins).Such groups not only slow down the pattern discovery process byintroducing a huge number of patterns, but they can also spuriouslyelevate the significance of a pattern. This happens for patterns thatappear many times within a family of highly homologous sequences andonly occasionally outside of it.

In order to deal with these problems, a database D may be “cleaned up”before the pattern discovery process begins. As shown in FIG. 16, thecleaning up process 1600 involves identifying and grouping togetherhighly similar proteins (step 1602). Two sequences are placed in thesame group if after being optimally aligned the shorter one has X% ofits positions (e.g., 50%) identical to that of the longer sequence. Theresulting groups are called redundant groups. The set D′ on which theinformation gathering process will be performed is comprised of: (a)those sequences in D which were not found to be sufficiently homologousto other proteins; and (b) the longest sequence from each of theredundant groups (step 1604). Finally, each of the redundant groups isseparately processed by the TEIRESIAS algorithm (step 1606), collectingpatterns until all the sequences of the group match at least one ofthese patterns. This approach guarantees that even groups containingmulti-domain proteins are treated correctly, by generating at least onepattern per domain. It is worth pointing out that patterns resultingfrom the processing of the redundant groups will usually be quite dense(the number of residues is going to be much larger than the number ofdon't care characters) and long. This is a consequence of the highhomology of the group sequences. For such patterns, we allow approximatematches during the search phase.

IV. Experimental Results

In this section we discuss experimental results associated with apreferred embodiment of the invention. That is, the following resultshave been generated by implementing both the dictionary formation(information gathering) and search engine methodologies of theinvention, explained in detail above, in accordance with SwissProt Rel.34 as the test database. A quantitative and qualitative description ofthe patterns discovered in the information gathering phase is given inthe first subsection (A) below by analyzing the coverage that thesepatterns achieve for SwissProt and by annotating the most frequentlyoccurring of them. In a second subsection (B) below, we present theresults of the search phase on a number of query sequences.

A. Information Gathering

The treatment of SwissProt starts by cleaning it up as described in theprevious section. The results of this process are detailed in FIG. 17.The clean-up process on SwissProt generates 9,165 redundant groups ofhighly similar sequences. The cleaned-up database (the one that theinformation gathering phase will operate on) is formed by removing thehighly-similar sequences from the original input and then augmenting theresulting set by adding in it the longest sequence from each redundantgroup.

Having the cleaned up database available, all that is needed forTEIRESIAS to work on it is setting the values of the parameters L, W andK_(min). As already explained, we use the settings L=6 and W=15.Further, in the results reported here we chose a threshold value of10⁻¹¹ and a confidence level of 95% in the computation of thedeviations. The value of K_(min) computed for these settings turned outto be 15. Running TEIRESIAS on the cleaned-up database with the valuesof L, W and K_(min) specified above generated a set Π (patterndictionary) of 534,185 patterns.

Mining the cleaned-up database is only the first step of the informationgathering phase. It is also necessary to apply the pattern discoveryprocess on the 9,165 redundant groups. Again, we use TEIRESIAS to treateach such group collecting enough <6, 15> patterns to make sure thateach sequence in the group matches at least one pattern. These patternsare then added to the set Π in order to form the final set of patterns Πwhich will be used by the search phase. FIG. 18 provides informationregarding the coverage achieved by these patterns over the entireSwissProt Rel. 34. The database regions covered by a pattern are exactlythose substrings matching the pattern. Notice that for dense and longpatterns (coming mostly from the processing of the redundant groups), wehave allowed for approximate matches, where “most” of the pattern(specifically, 80% of the patterns' residues) is matched by a region. Itis worth pointing out that most of the uncovered sequences arefragments. More specifically, only 231 have size more than 50. FIG. 19gives distributions for the following characteristics of the patterns inΠ: (i) length of the SwissProt Rel. 34 patterns; and (ii) number ofamino acids or residues.

As exemplified in FIG. 18, one of the key goals for the success of thesearch phase to follow (namely the good coverage of SwissProt) has beenachieved. The question that remains to be answered is if the patternsdiscovered are of biological relevance. In an effort to address thisconcern, we analyzed the most frequently occurring among these patterns.The resulting annotation is presented in FIG. 20. From this analysis, itis evident (at least for the patterns which were examined) that thepattern discovery process identifies sequence features that arebiologically important.

FIG. 20 illustrates the 100 patterns with the highest support. Whereverpossible, the patterns within a category were aligned with respect toone another. The lower case italics were used for convenience and areplace-holders for the following bracketed expressions: a: [STGDAR], b:[STGDK], c: [STGDKY], d: [STGK], e: [GASMDL],f: [GISETV], g: [LIVMFY],h: [LIVMF], i: [LIVMA], j: [LIVMC], k: [LIVMF], l: [ILVMF], m: [QKCS],n: [KRQA], o: [IVTNF], p: [QKCASN], q: [QKIAGN], r: [RKAHQN], s:[KRQNE], t: [KRQMN], u: [LFYIMS], and v: [AGSPE]. A bracket indicates aposition that can be occupied by any one of the residues in the bracket.

It is to be appreciated that not all the discovered patterns exhibitsuch clear cut functional specificity. Several of them correspond toregions (e.g., loops, coiled-coils, transmembrane) which aretraditionally considered uninteresting at least for the purposesoffunctionally annotating aprotein. Sometimes, though, even such weaksimilarities can provide useful hints for the characterization ofprotein regions. We have implemented two mechanisms that allow theexploitation of this potential. First, the user is provided with thelist of all the patterns which are matched by the query sequence. Anexpert user will, in most cases, be able to identify which patterns areof biological importance. Selection of a particular pattern leads thento a refinement of the scoring, focusing only on the areas of thedatabase covered by this pattern. Second, when the underlying data baseincludes annotations of the various database sequence regions, thisannotation is used in conjunction with the patterns for the extractionof useful information. Examples of the use of these two mechanisms aregiven in the next subsection.

B. Searching

In order to illustrate the searching phase (and to explain how it may beused), we selected two query sequences. The first is a well studied andannotated core histone 3 protein (SwissProt ID: H31_HUMAN), while thesecond is a not yet characterized ORF (SwissProt ID: YZ28_METJA) fromMethanococcus jannaschii.

H31_HUMAN

Core histones have been the object of extensive study due to theircentral role in the packaging of DNA within the cell. These smallproteins are rich in positively charged amino-acids that help them bindto the negatively charged DNA double helix, see J. D. Watson, N. H.Hopkins, J. W. Roberts, J. Steitz and A. M. Weiner, “Molecular Biologyof the Gene,” The Benjamin/Cummings Publishing Company, Fourth Edition,1987. The four core histones (H2A, H2B, H3 and H4) bind together into anoctameric construct (reminiscent of a cylindrical wedge) that providesthe substrate for 146 bps long DNA segments to wrap around, thuscreating the nucleosome complexes within the cell chromatin.

The SwissProt Rel. 34 database contains 33 sequences which are annotatedas Histones 3, among which is H31_HUMAN, the core histone 3 proteinfound in humans. The top-scoring results of searching this sequence withour homology detection tool are tabulated in FIG. 21. Next to eachsequence is given the similarity score of the highest scoring localalignment between that sequence and H31_HUMAN. The scores mentioned inFIG. 21 are obtained using the PAM 130 matrix (see M. O. Dayhoff, R. M.Schwartz and B. C. Orcutt, “A model of evolutionary change in proteins,”Atlas of Protein Sequence and Structure, 5:345-352,1978) and everymatching sequence from the database is assigned the score of its highestscoring segment.

All the 33 core Histones 3 of SwissProt Rel. 34 are correctly identifiedas homologous to H31_HUMAN. Furthermore, several other proteins(YB21_CAEEL, CENA_HUMAN, CSE4_YEAST, YL82_CAEEL, CENA_BOVIN, YMH3_CAEEL)are found to have extensive local similarities with H31_HUMAN.Inspection of the annotation for these proteins indicates that they areknown histone 3-like proteins. As a final note, H3_NARPS (a knownhistone 3) appears within the release 34 of SwissProt only as a fragmentand that is the reason that it is scored lowest in the list of results.

FIG. 22 gives a selected view (both high and low-scoring) of thealignments generated for the query sequence H31_HUMAN. In FIG. 22, localalignments of H31_HUMAN with a highly similar (H3_YEAST) and amoderately similar (CENA_HUMAN) protein are shown. For every sequence, anumber of local similarities are reported. In every such similarity, therelevant query (“Query”) and the data base sequence (“Seq”) regions arelisted one under the other having between them the resulting consensusregions. We use the character ‘+’ to indicate chemically similar aminoacids.

YZ28_METJA

H31_HUMAN is in a sense an easy test case because the database containsseveral sequences which are highly homologous to it. An interestingquestion to ask is how does our methodology fare when presented with“borderline” sequences, i.e., sequences for which no known homologyexists. In an effort to address this question, the system was presentedwith the yet not annotated sequence YZ28_METJA, an open reading framewith 1272 residues from the genome of M jannaschii.

The top scoring alignments produced by our system when presented withthe query sequence YZ28_METJA are depicted in FIG. 23. The mutationmatrix used is PAM130.

For the purposes of functional annotation of YZ28_METJA, the abovementioned results are not very enlightening as the database hits involvequite diverse proteins: the first two (NTNO_HUMAN, NTNO_BOVIN) aresodium-dependent noradrenaline transporters, while the last one(KAPL_APLCA) is a kinase.

With these questions in mind, we proceeded to a closer examination ofthe similarities between YZ28_METJA and the database sequences. For thisanalysis, every pattern matching YZ28_METJA was scrutinizedindividually. It is to be appreciated that the search phase of theinvention allows the user to select any of the patterns matched by thequery sequence at hand and focus on the local alignments induced by thatparticular pattern alone, disregarding all the other patterns. Thisfeature was employed for each of the patterns matched by YZ28_METJA. Theintention was to discover if any such pattern is specific to oneparticular protein family, thus giving clues about the functionality ofYZ28_METJA.

As it turned out, there exist three patterns (namely, the patterns“Y..S..L...DLK”, “NIL......IKL” and “I.H.DLK......D”) which are veryspecific for the kinase family. FIG. 24 describes a few among the topscoring alignments produced for the first one of them, that is, the topscoring local alignments for the query sequence YZ28_METJA induced bythe pattern “Y..S..I...DLK”. The mutation matrix used is PAM130. FIG. 25contains a complete listing of all the database sequences containingthat particular pattern. FIGS. 26 and 27 give the corresponding listingsfor the remaining two patterns. FIG. 28 provides a graphicrepresentation of: (a) the distribution of all the patterns matched byYZ28_METJA and (b) the areas covered by the three kinase-specificpatterns.

The pattern “Y..S..I...DLK” generates 24 hits within SwissProt. All ofthese proteins (with the exception of NABA_RAT, a sodium/bile acidcotransporter) are annotated as protein kinases (two of them, KD82_SCHPOand KKKl_YEAST, are characterized as putative/probable kinases) with themajority belonging or showing similarity to the serine/threonine kinasefamily. Furthermore, “Y..S..I...DLK” not only belongs to the kinasedomain of these proteins but it actually contains the active site (aminoacid D) of that domain.

In FIG. 25, SwissProt Rel. 34 sequences containing the pattern“Y..S..I...DLK” are shown. All of them are annotated as protein kinasesor probable/putative protein kinases (almost exclusively of theserine/threonine variety). The only exception is the protein NABA_RATwhich is annotated as a sodium/bile acid cotransporter.

Similar results are obtained for “NIL......IKL ”, the second of thethree patterns, are shown in FIG. 26. In this case the number ofdatabase hits is 34 and all of them (excluding two unannotated ORFs fromYeast and Mycoplasma hominis) are known (or probable) protein kinases.Again, serine/threonine kinases are the majority.

Finally, the third pattern “I.H.DLK......D” generates 30 SwissProt Rel.34 hits, all of them known or putative protein kinases. This is shown inFIG. 27. Furthermore, as in the case of the first of the three patterns,the pattern “I.H.DLK......D” includes the active site of the kinasedomain.

It is interesting to notice that all three of the aforementionedpatterns are specific instances of (parts of) the following generalpattern:

[LIVMFYC].[HY].D[LIVMFY]K..N[LIVMFYCT][LIVMFYCT][LIVMFYCT]

where the notation [XYZ] indicates a position which can be occupied byany of the residues X, Y, Z. This more general pattern is the PROSITEdatabase entry with accession number PS00108, namely, the signature ofthe serine/threonine protein kinase active site. Notice that thisPROSITE signature is too specific for picking up a kinase catalytic sitein the areas of YZ28_METJA covered by the three patterns examined above.This situation (known in the language of artificial intelligence asoverrepresentation of the training set) is typical of leaming systemstrained by a finite subset of the entire universe: there is always thedanger that the set of positive examples (in this case, the specific setof known serine/threonine kinases used by PROSITE) is biased and as aresult the features learned (here the kinase signature) while explainingthe observations are not general enough to extrapolate efficiently tonew instances of the family under consideration (i.e., there arefalsenegatives). The cure for this problem is the use of as large a trainingset as possible and this is the crux of the approach we present here.

As mentioned, FIG. 28 provides a graphic representation of: (a) thedistribution of all the patterns matched by YZ28_METJA and (b) the areascovered by the three kinase-specific patterns.

In FIG. 28(a), there are 410 patterns (among those discovered in theinformation gathering phase) matched by YZ28_METJA. A pattern “covers” aresidue position if it starts before (or at) that position and endsafter (or at) that position. The chart shows, for each residue position(x-axis), how many patterns (y-axis) cover that position. As shown inFIG. 28(b), the three kinase pattern discussed in the text match thesequence at offsets 35 (pattern “Y..S..L...DLK”), 112 (pattern“NIL......IKL”) and 1052 (pattern “I.H.DLK......D”). These offsets aredepicted here relative to the spikes of the pattern distribution in FIG.28(a).

Using Existing Annotation

Of the 410 patterns matched by YZ28_METJA, only the three patternsanalyzed above exhibit such clear cut functional specificity. This doesnot mean that the remaining 407 are useless. The kind of biologicalinference that can be drawn from a local similarity between twosequences is not always of a functional nature. Sometimes the homologyindicates preservation of structure and yet other times it mightcorrespond to functional units of a supporting role (e.g., DNA-bindingdomains) in the overall function of the sequences compared. In an effortto explore such weaker similarities, we have provided for a way toexploit the annotation available in the underlying database. In thedescription given below we assume the SwissProt annotation format.

The SwissProt data base associates with most of its sequencesannotations of sequence regions (the FT lines, see A. Bairoch and R.Apweiler, “The SWISS-PROT protein sequence data bank and its supplementTrEMBL in 1998”, Nucleic Acids Res., 26:38-42, 1998). A typical regiondescription looks like:

FT DOMAIN 528 779 PROTEIN KINASE

where the keyword “FT” indicates that this is a region description lineand the remaining line describes the region by giving its starting andending positions (from residue 528 up to and including residue 779 ofthe relevant data base sequence) and its annotation (a protein kinasedomain).

When presented with a pattern P, we can use (as already mentioned) theoffset list L_(D)(P) to locate all the sequences in the database thatmatch P. Assume that S is such a sequence and that at offset j within Sbegins a substring that matches P. If P happens to fall in an annotatedregion of S (either entirely or in part), we can associate this regionwith P. Performing this process for every sequence S matching P resultsin a set RS_(D)(P) of regions associated with P. FIG. 29 gives anexample of part of the output produced by our system for one of thethree kinase patterns described above. That is, FIG. 29 illustratesanalysis of individual patterns using the SwissProt annotation: some ofthe database sequences matching the pattern “I.H.DLK......D”. For everysuch sequence, its ID and DE lines are reported (see A. Bairoch and R.Apweiler, “The SWISS-PROT protein sequence data bank and its supplementTrEMBL in 1998”, Nucleic Acids Res., 26:38-42, 1998), giving theSwissProt name of the sequence and a short description of itsfunctionality. Next follows the offset within the sequence where thematch originates. Finally, there are the FT lines for all the annotatedregions having an intersection with the region covered by the pattern.

Given now a pattern P matched by a subsequence A of a query sequence Q,the question is how to use RS_(D)(P) in an effort to characterize A. Anumber of approaches can be used. For example, if RS_(D)(P) is largeenough and the majority of its members agree in their functionality,then it can be inferred that A is quite likely to have the samefunctionality. Another consideration is the relative lengths of thepattern P and the regions described by the FT lines. If, for example, apattern P has an extent of 15 residues while an annotated sequenceregion containing P has a length of 300 amino acids then one might notwant to transfer the annotation of that region to P. In conclusion, theend user is expected to apply his/her expertise in deciding how to bestexploit the information provided by the system.

FIG. 30 illustrates two ways to use the sets RS_(D)(P) for annotatingregions of YZ28_METJA, thus extending the picture drawn in FIG. 28(b).That is, FIG. 30 shows a characterization of various segments ofYZ28_METJA from the annotation of the patterns matched by thesesegments. The annotation of the patterns is obtained by exploiting theinformation available for the various regions of the database sequencesalso matching these patterns. The segments are shown again relative tothe spikes of the distribution of patterns over the entire YZ28_METJA.The first approach (FIG. 30(b)) assigns an annotation X (e.g.,X=transmembrane region) to a pattern P if: (i) the size of RS_(D)(P) isat least 15; (ii) the majority (80%) of the regions in RS_(D)(P) areannotated as X; and (iii) at least 50% of every region of RS_(D)(P)annotated as X is covered by P. The second approach (FIG. 30(c)) sharesthe requirements (i) and (ii) above and relaxes (iii) by allowing thepercentage of the annotated region covered by the pattern to be 30% ormore.

Performance

The running time of a homology search for a query sequence Q depends on:(i) the size of the set of patterns Π used; and (ii) the actual numberof local similarities (induced by the patterns matching Q) between Q andthe database sequences. For the case of SwissProt Rel. 34 used here,typical searches for query proteins of size around a thousand residuestake 4-6 seconds on a Pentium 266 MHz computer with 256 MB of memory. Itshould be mentioned that the running time reported above is achieved bykeeping all the program data (patterns and their offset lists) inmemory. For SwissProt, this data occupies around 200 MB.

In accordance with the various aspects of the invention, we haveprovided a methodology for performing sequence similarity searches basedon the discovery of patterns over an underlying database D of proteinsand the use of these patterns for the identification of homologiesbetween a query sequence and the proteins of the database at hand. Wedescribed a way to precisely define a set of patterns to be searchedusing statistical arguments and discussed how patterns provide moresensitivity in identifying significant homologies by introducing memoryinto the statistical computations. Finally, the utility of themethodology was exhibited using the SwissProt Rel. 34 database as a testbed and showing how the system can be used for annotating querysequences. In this context, we also discussed the potential ofexploiting the discovered patterns in conjunction with the annotation ofthe underlying database towards characterizing even weak similaritiesbetween the query and the data base sequences.

Advantageously, one aspect of the sequence homology detection system ofthe invention that sets it apart from prior art pattern based tools forhomology detection (e.g., BLOCKS) is the completeness of the set ofpatterns used. The patterns are learned in an unsupervised way from avery large training set, that of all the proteins within the underlyingdatabase D. There are no bias-creating prior assumptions on whichsequences “should” be considered as members of the same family. As aresult, the patterns discovered are expected to be more sensitive.Furthermore, by considering together sequences of distinctfunctionalities, we are able to discover weak similarities that spanfamily boundaries (e.g., patterns that describe transmembrane regions).Such similarities, although not sufficient for the inference offunctional annotations, nevertheless give useful information regardingthe role of different parts of the query sequence under examination.

Another advantage of the system of the invention is the running timesachieved for the homology searches. The speedup afforded by usingpatterns rather than scanning the entire database for every search canbecome a factor as the size of genomic databases increases ever faster(especially for users who want to run in-house tests rather than usepublic servers).

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may beaffected therein by one skilled in the art without departing from thescope or spirit of the invention.

What is claimed is:
 1. A computer-based method of detecting homologies between a plurality of sequences in a database and a query sequence, the method comprising the steps of: accessing patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database; comparing the query sequence to the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns; and generating a score for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score represents a degree of homology between the query sequence and the detected sequence.
 2. The method of claim 1, wherein the database includes sequences having both known and unknown sequence features.
 3. The method of claim 1, wherein the sequences represent proteins.
 4. The method of claim 1, wherein the comparing step further comprises generating a segment for each region of the database associated with a pattern that matches a portion of the query sequence.
 5. The method of claim 4, wherein the segment is represented as a set of values (i, j, k, l), wherein (j, k) represents the region of the database associated with the pattern such that k represents an offset of the j-th sequence in the database which the pattern matches, i represents the offset of the query sequence which the pattern matches, and l represents the length of the pattern.
 6. The method of claim 4, wherein the comparing step further comprises chaining together compatible segments to form an extended segment when more than one segment is generated.
 7. The method of claim 4, wherein the comparing step further comprises collecting in a set, the sequences of the database which match at least one pattern which also matches the query sequence, and respective segments generated for each sequence.
 8. The method of claim 7, wherein the scoring step further comprises assigning a score to each of the segments associated with each sequence in the set.
 9. The method of claim 8, wherein a score is assigned to each segment based on a mutation matrix.
 10. The method of claim 8, wherein the scoring step further comprises assigning a score to each sequence in the set based on the scores assigned the segments associated with the sequence.
 11. The method of claim 10, wherein the sequence score assigning step further comprises forming a directed graph wherein vertices represent the segments associated with the sequence and two vertices are connected by edges based on a relative order of respective offsets associated with the two segments represented by the vertices, the respective offsets including an offset of the query sequence and an offset of the sequence being scored.
 12. The method of claim 11, wherein a weight is assigned to each vertex based on the segment score.
 13. The method of claim 12, wherein a weight is assigned to each edge based on a difference between a displacement of the associated query sequence offsets and a displacement of the offsets of the sequence being scored.
 14. The method of claim 13, wherein the edge weight is inversely related to the size of the difference between displacements.
 15. The method of claim 14, wherein the sequence score assigning step further comprises identifying a path through the directed graph which yields the highest combined score of vertex weights and edge weights included in the path, the highest combined score representing the score for the sequence.
 16. The method of claim 4, wherein one or more of the patterns which are characterized as being of a low complexity are ignored.
 17. The method of claim 16, wherein a pattern is characterized as being of a low complexity when the pattern represents a sequence region having at least a predetermined number of repeated characters.
 18. The method of claim 16, wherein a pattern is characterized as being of a low complexity when the pattern represents a sequence region having overlapping occurrences of the same set of characters.
 19. The method of claim 16, wherein a pattern is characterized as being of a low complexity based on a variability associated with the pattern.
 20. The method of claim 19, wherein the variability of a pattern is a ratio of the number of times a character appears in a pattern over the total number of positions in the pattern covered by the character.
 21. The method of claim 16, wherein a pattern is characterized as being of a low complexity when the pattern matches the query sequence approximately in at least a predetermined number of characters to the left and the right of an offset associated with the query sequence.
 22. The method of claim 16, wherein a pattern is characterized as being of a low complexity when the pattern matches the sequence from the database approximately in at least a predetermined number of characters to the left and the right of an offset associated with the sequence from the database.
 23. A network-based method of detecting homologies between a plurality of sequences in a database accessed at a server in the network and a query sequence entered at a client device in the network, the method comprising the steps of: obtaining, at the server from the network, the query sequence entered at the client device; accessing, at the server, patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database; comparing, at the server, the query sequence to the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns; and generating a score, at the server, for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score determines a degree of homology between the query sequence and the detected sequence.
 24. The method of claim 23, wherein the network includes the Internet.
 25. The method of claim 23, wherein the database includes sequences having both known and unknown sequence features.
 26. The method of claim 23, wherein the sequences represent proteins.
 27. Apparatus for detecting homologies between a plurality of sequences in a database and a query sequence over a network, the apparatus comprising: a client device configured to enter the query sequence and transmit the query sequence over the network; and a server, coupled to the client device via the network, and configured to: (i) obtain the query sequence from the client device over the network; (ii) access patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database; (iii) compare the query sequence to the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns; (iv) generate a score for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score determines a degree of homology between the query sequence and the detected sequence; and (v) transmit at least a portion of the detection results to the client device over the network.
 28. Apparatus for detecting homologies between a plurality of sequences in a database and a query sequence, the apparatus comprising: at least one processor operative to: (i) access patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database; (ii) compare the query sequence to the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns; and (iii) generate a score for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score determines a degree of homology between the query sequence and the detected sequence.
 29. An article of manufacture for detecting homologies between a plurality of sequences in a database and a query sequence, comprising a machine readable medium containing one or more programs which when executed implement the steps of: accessing patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database; comparing the query sequence to the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns; and generating a score for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score determines a degree of homology between the query sequence and the detected sequence.
 30. A computer-based method of detecting homologies between a plurality of sequences in a database and a query sequence, the method comprising the steps of: accessing patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database; comparing the accessed patterns to the query sequence to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the accessed patterns; and generating a score for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score represents a degree of homology between the query sequence and the detected sequence.
 31. A computer-based method of detecting homologies between a plurality of sequences in a database and a query sequence, the method comprising the steps of: accessing patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database, wherein at least a portion of the patterns each have a variable length associated therewith and wherein, for one or more positions in the pattern, a nature or an identity of a corresponding portion of a sequence is unspecified; comparing the query sequence and the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns; and generating a score for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score represents a degree of homology between the query sequence and the detected sequence. 